Outline 0 Introduction In this paper…… An overview of the purpose of your analysis including details or references necessary.
1 Overview 1.1 Business Problem 1.2 Business Value Proposition
2 Data Description ◦the source ◦the variables with definitions or a link to a codebook ◦the number of observations in your data set ◦detail on missingness ◦a glimpse of your data if possible (e.g. head and tail)
3 Data Preprocessing ◦Feature generation ◦Imputation ◦Cleaning or merging of categories ◦Outlier removal ◦Anything that changes your data from the original form
4.Your Final Analysis in small pieces with annotation 5.Graphs to visualize different steps in your analysis 6.Clear discussion of why you made analysis choices 7.References to papers or citations you used to make decisions about the analysis
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: https://www.vinhoverde.pt/en/
Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal, 2009
Download red wine dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
Download white wine dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Download available at: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality.names
wineRed = read.csv("winequality-red.csv",sep = ";")
wineWhite = read.csv("winequality-white.csv",sep = ";")
head(wineRed)
str(wineRed)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
head(wineWhite)
str(wineWhite)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
library(tidyverse)
wineRed %>% is.na() %>% colSums()
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
wineWhite %>% is.na() %>% colSums()
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
There are no missing data present in both red and white wine datasets
wineRed$wine.type <- 1
wineWhite$wine.type <- 0
dim(wineRed)
## [1] 1599 13
dim(wineWhite)
## [1] 4898 13
wine = rbind(wineRed, wineWhite)
wine$wine.type <- as.factor(wine$wine.type)
dim(wine)
## [1] 6497 13
| rows | 1599 |
| columns | 13 |
| discrete_columns | 0 |
| continuous_columns | 13 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 1599 |
| total_observations | 20787 |
| memory_usage | 163576 |
| rows | 4898 |
| columns | 13 |
| discrete_columns | 0 |
| continuous_columns | 13 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 4898 |
| total_observations | 63674 |
| memory_usage | 493472 |
| rows | 6497 |
| columns | 13 |
| discrete_columns | 1 |
| continuous_columns | 12 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 6497 |
| total_observations | 84461 |
| memory_usage | 627904 |
head(wine)
library(DataExplorer)
plot_histogram(wine)
library(DataExplorer)
plot_bar(wine)
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality wine.type
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000 0:4898
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000 1:1599
## Median :3.210 Median :0.5100 Median :10.30 Median :6.000
## Mean :3.219 Mean :0.5313 Mean :10.49 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :9.000
str(wine)
## 'data.frame': 6497 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ wine.type : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
library(DataExplorer)
plot_correlation(wine, type = "c")
Some of the predictors have strong correlations: * Between density and alcohol (-0.69) * Between free.sulfur.dioxide and total.sulfur.dioxide (0.72)
set.seed(13)
trainIndex = sample(1:nrow(wine), size = round(0.75*nrow(wine)), replace=FALSE)
train<-wine[trainIndex, ]
valid<-wine[-trainIndex, ]
nrow(train)
## [1] 4873
nrow(valid)
## [1] 1624
library(MASS)
# Fit the full model
full <- lm(quality ~., data = train)
# Stepwise regression model
step <- stepAIC(full, direction = "both", trace = FALSE)
summary(step)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar +
## free.sulfur.dioxide + total.sulfur.dioxide + density + pH +
## sulphates + alcohol + wine.type, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5440 -0.4690 -0.0480 0.4567 3.0005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.178e+02 1.798e+01 6.553 6.23e-11 ***
## fixed.acidity 1.027e-01 1.895e-02 5.419 6.28e-08 ***
## volatile.acidity -1.530e+00 8.799e-02 -17.392 < 2e-16 ***
## residual.sugar 6.846e-02 7.206e-03 9.501 < 2e-16 ***
## free.sulfur.dioxide 3.898e-03 8.880e-04 4.390 1.16e-05 ***
## total.sulfur.dioxide -1.261e-03 3.756e-04 -3.356 0.000796 ***
## density -1.179e+02 1.825e+01 -6.457 1.17e-10 ***
## pH 6.013e-01 1.066e-01 5.642 1.78e-08 ***
## sulphates 6.918e-01 8.790e-02 7.871 4.32e-15 ***
## alcohol 2.169e-01 2.290e-02 9.469 < 2e-16 ***
## wine.type1 3.660e-01 6.754e-02 5.418 6.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7342 on 4862 degrees of freedom
## Multiple R-squared: 0.2991, Adjusted R-squared: 0.2977
## F-statistic: 207.5 on 10 and 4862 DF, p-value: < 2.2e-16
p.valid<-predict(step, newdata=valid)
head(p.valid)
## 5 7 13 15 17 20
## 4.928785 5.066270 5.376766 5.228540 6.034067 5.690349
library(caret)
RMSE(p.valid, valid$quality)
## [1] 0.7319179
R2(p.valid, valid$quality)
## [1] 0.2857878
pressure
paste("The mean pressure is:", round(mean(pressure$pressure),3), "mm")
[1] “The mean pressure is: 124.337 mm”
library(knitr)
kable(head(pressure), format="pipe", digit=3)
| temperature | pressure |
|---|---|
| 0 | 0.000 |
| 20 | 0.001 |
| 40 | 0.006 |
| 60 | 0.030 |
| 80 | 0.090 |
| 100 | 0.270 |
kable(tail(pressure), format="pipe", digit=3)
| temperature | pressure | |
|---|---|---|
| 14 | 260 | 96 |
| 15 | 280 | 157 |
| 16 | 300 | 247 |
| 17 | 320 | 376 |
| 18 | 340 | 558 |
| 19 | 360 | 806 |
You can also embed plots, for example:
For more details on organizing with tabset go here https://bookdown.org/yihui/rmarkdown-cookbook/html-tabs.html.
You can include both inline and offset equations.
You can include inline equations like \(y = nx + b\), you can also do more complicated inline equations such as \(\hat{y} = \hat{\beta} + \hat{\beta_1}x\)